Compact trie clustering for overlap detection in news

نویسندگان

  • Richard Elling Moe
  • Dag Elgesem
چکیده

We investigate document clustering through adaptation of Zamir and Etzioni’s approach to online web document clustering. Specifically we generalize the Suffix Tree Clustering method to allow for a wider range of clustering techniques. We apply the modified technique to a corpus of news articles improving precision by 29% while running 8% faster than the original algorithm.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Genetic Approach to Tuning Compact Trie Clustering

The Compact Trie method for document clustering is sensitive to the kind of text it is applied to, but contains various parameters that may be tuned for adaptation to specific applications. We implement a genetic algorithm for optimizing these parameters and apply it to a corpus of texts to demonstrate the feasibility of using genetic algorithms for tuning.

متن کامل

Compact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of the Depth

Suffix trees are the most frequently used data structures in algorithms on words. In this paper, we consider the depth of a compact suffix tree, also known as the PAT tree, under some simple probabilistic assumptions. For a biased memoryless source, we prove that the limiting distribution for the depth in a PAT tree is the same as the limiting distribution for the depth in a PATRICIA trie, even...

متن کامل

Trie-Based Data Structures for Sequence Assembly

We investigate the application of trie-based data structures, suux trees and suux arrays in the problem of overlap detection in fragment assembly. Both data structures are theoretically and experimentally analyzed on speed and space. By using heuristics, we can greatly reduce the calls to the time-consuming dynamic programming, and have improved the speed of overlap detection up to 1,000 times ...

متن کامل

Correlation Clustering for Crosslingual Link Detection

The crosslingual link detection problem calls for identifying news articles in multiple languages that report on the same news event. This paper presents a novel approach based on constrained clustering. We discuss a general way for constrained clustering using a recent, graph-based clustering framework called correlation clustering. We introduce a correlation clustering implementation that fea...

متن کامل

Compact Balanced Tries

summary by Mireille R egnier] Classical B?trees and preex B?trees 1] ooer both fast, direct addressing and easy sequential processing. They are balanced, segmented, and exible. Flexibility means that a B?tree leaf splitting may be done at any position inside the leaf. This property is emphasised: one generates and suppresses empty leaves, while forcing the other leaves to a 100% storage utilisa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013